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CRT 30 Years Later 
Continued from p. 20 


proficient, and advanced—to report 
levels of achievement. Will each of 
these two shifts add to the CR in- 
terpretability of NAEP results? The 
answers are no and no. 

Performance assessment works 
against high item density. Such 
tasks are often time consuming, 
and therefore fewer of them can be 
administered. Such tasks usually 
measure a variety of skills and, cor- 
respondingly, yield low correlations 
between tasks. The resulting com- 
bination of low item density and 
task specificity precludes CR inter- 
pretations. 

Determining that a student is at 
the proficient level in a subject— 
reading, for example—does not give 
us a confident assessment of the 
specific tasks a student can or can- 
not do. Although the achievement 
levels are defined in terms of groups 
of “elusive constructs,” illustrated 
above, performance on tasks mea- 
suring such constructs can span the 
achievement scale. 

With the shift of the NAEP con- 
tract from the Education Commis- 
sion of the States to the Educational 
Testing Service came use of item re- 
sponse theory in scale development 
and reporting. Item response theory 
(IRT) technology doesn’t automati- 
cally infuse interpretability into the 
NAEP scales. IRT can help increase 
interpretability if the instruments 
have high item density. IRT scales 
have the best CR interpretability 
when they contain both the ability 
levels of subjects (students, schools, 
etc.) and the difficulty levels of a ho- 
mogeneous set of items. Then they 
come closer to what Bob Glaser 
wanted of a CR interpretation. He 
wrote: 


Underlying the concept of 
achievement measurement is the 
notion of a continuum of knowl- 
edge acquisition ranging from no 
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Wedman, No. 2, p. 5 
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Koretz, Stecher, Klein, & McCaffrey, No. 3, p. 5 


proficiency at all to perfect perfor- 
mance. An individual’s achieve- 
ment level falls at some point on 
this continuum as indicated by 
the behaviors he displays during 
testing. . . . the specific behaviors 
implied at each level of profi- 
ciency can be identified. . . . Crite- 
rion-referenced measures indicate 
the content of the behavioral 
repertory, and the correspondence 
between what an individual does 
and the underlying continuum of 
achievement. (pp. 519-520) 


Older and Wiser 


Now that CR testing has 30 years of 
experience behind it, we can reflect 
on its capabilities with less bright- 
eyed optimism and more realism. 
Part of that realism is that only oc- 
casionally are tests desired that, be- 
cause of their restricted domain of 
coverage and high item saturation, 
are capable of supplying CR inter- 
pretations. But when they are de- 
sired, care is required in the choice 
of tasks. Test makers can profit 
from guidelines about task con- 
struction and selection. Their 
awareness of the underlying cogni- 
tive demands for each task can facil- 
itate both the choice of tasks and the 
interpretation of performance. CR 
interpretation is also aided by scales 
that show ability and task difficulty 
together and by interpretations that 
employ constructs closer to the item 
level. But most of all, to capture 
Glaser’s original intention that 
CRTs be capable of telling us what 
students can and cannot do, we need 
higher item density—more items 
per acre of domain. 


Note 

This is a revision of a paper presented 
as part of the symposium, “Criterion- 
Referenced Measurement:A 30-Year 
Retrospective,” at the Annual Meetings 
of the American Educational Research 
Association and the National Council 
on Measurement in Education in At- 
lanta, GA, April 1993. I wish to thank 


the editor and reviewers for many help- 
ful suggestions with respect to the orig- 
inal paper. 
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